The following tutorials teach how to run NCBI [legacy-blast and] BLAST+ from the command line along with useful AWK, Bash and Perl idioms and 1-liner programs to effectively parse results.
This tutorial expects that the user has a Unix/Linux environment with the standard bash shell installed, as well as the legacy- and blast+ software suites, which are freely available for download from the NCBI ftp server
For further instruction on the installation and usage of the BLAST+ suite, read NCBI’s BLAST Book and NCBI’s The Statistics of Sequence Similarity Scores tutorial.
This material is distributed from the following TIB-filoinfo GitHub repository
This
work is licensed under a
Creative
Commons Attribution-NonCommercial 4.0
Remember that to run a BLAST SEARCH we need to select the proper blast program, based on the nature of our query sequence(s) and that of the database to be searched for homologues.
| query | database | program | note |
|---|---|---|---|
| DNA | DNA | blastn | - |
| PROT | PROT | blastp | - |
| DNA | PROT | blastx | The DNA query is translated in the 6 frames and the products searched against protein DB |
| PROT | DNA | tblastn | The DNA DB (genome) is translated in the 6 frames and queried with protein sequences |
| DNA | DNA | tblastx | The DNA query and DNA DB are both translated in the 6 frames and searches performed at the protein level |
Figure 1. The four standard BLAST programs.
In 1990, Samuel Karlin and Stephen Altschul published a paper describing the statistics underlying local alignment scores, based on the hit expectancy value or E-value.
\[ E = kmne^{-{\lambda}S} \] This equation states that the number of alignments expected by chance (\(E\)) during a blast search for a given high-scoring pair (HSP) with a raw score (\(S\)), is a function of the size of the search space (effective query sequence length \(m\) * effective length of the database \(n\) (in number of residues)), the normalized score of the hit (\({\lambda}*S\)), and a minor constant \(k\).
The following formula provides the relationship between both values:
\[P = 1 -e^{-E}\]
Note that for values \(<0.001\), the E-value and the P-value for a given HSP are essentially identical.
The following code will setup our working directories for the tutorials.
# i) download or copy the sequence data to your working directories for this practice
cd && [ ! -d sesion2_blast ] && mkdir sesion2_blast && cd sesion2_blast
# ii) Copy the data from my home directory
cp /home/vinuesa/cursos/intro_biocomp/sesion2_blast/16S_4blastN.tgz .
cp /home/vinuesa/cursos/intro_biocomp/sesion2_blast/gene_discovery_and_annotation_using_blastx.tgz .
# or use wget to fetch the directly from the command line
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/16S_4blastN.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/gene_discovery_and_annotation_using_blastx.tgz
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion3_BLAST/data/UniProt_proteomes.tgz
# iii) Generate two subdirectories to hold the data and analysis results
mkdir BLAST_DB_AA BLAST_DB_NT UniProt_proteomes
# iv) Save the path to our parental working directory in a variable
wkdir=$(pwd)
echo $wkdir
# v) Move each *tgz file to the proper dir
mv 16S_4blastN.tgz BLAST_DB_NT
mv UniProt_proteomes.tgz UniProt_proteomes
mv gene_discovery_and_annotation_using_blastx.tgz BLAST_DB_AA
Now we are set to start the hands-on tutorials on running BLAST from the command line
The aim of the exercise is to learn the basics of a BLAST analysis. We will generate an indexed BLAST database of nucleotide sequences and interrogate it using blastn.
We will use the following data for the exercise.
The objective of the exercise is to classify the query sequences in 16S_problema.fna at the genus level based on BLAST statistics.
Move into the directory holding the nucleotide sequences and explore both files using standard shell filtering tools and modify the FASTA headers in 16S_seqs4_blastDB.fna to make them suitable for formatdb|makeblastdb indexing
# We need to untar & uncompress (gunzip) the compressed tar file
cd BLAST_DB_NT/ && tar -xvzf 16S_4blastN.tgz
# 1) explore the fasta headers:
grep '>' 16S_seqs4_blastDB.fna |less
# 1.1) How many genera and species per genus are found in the source file 16S_seqs4_blastDB.fna?
# i. the genera
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1 | sort | uniq -c | sort -nrk1
# ii. the species
grep '>' 16S_seqs4_blastDB.fna | cut -d_ -f1,2 | sort | uniq -c | sort -nrk1
# 1.2) Are the sequences in a suitable format for makeblastdb indexing?
# What command would you use to inspect the sequences?
# 1.2.1) generate a proper FASTA header for db indexing
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna | grep '^>'
perl -pe 'if( /^>/ ){ $c++; s/>/>gnl|16S_DB|$c / }' 16S_seqs4_blastDB.fna > 16S_seqs4_blastDB.fnaed
Note: the commented command lines in the following blocks provide the code to run legacy blast. The exercises are demonstrated using the blast+ suite of programs
#formatdb -i 16S_seqs4_blastDB.fnaed -p F -o T
makeblastdb -h
makeblastdb -help
makeblastdb -in 16S_seqs4_blastDB.fnaed -dbtype nucl -parse_seqids
# 2.1) run blastn (blastall -p blastn), and get only the best hit (-b 1) in tabular format (-m 8).
# blastall -p blastn -i 16S_problema.fna -b 1 -a 6 -d 16S_seqs4_blastDB.fnaed -m 8 > 16S_problema_blastN_b1_m8.tab
blastn -query 16S_problema.fna -db 16S_seqs4_blastDB.fnaed -outfmt 6 -num_alignments 1 > 16S_problema_blastN_maln1_m6.tab
# 2.2) explore the table structure
head 16S_problema_blastN_maln1_m6.tab
13 16S_DB:480 99.485 1359 5 2 1 1359 48 1404 0.0 2475 18 16S_DB:494 99.033 1344 10 3 1 1344 49 1389 0.0 2416 2 16S_DB:549 99.552 1340 5 1 4 1343 81 1419 0.0 2440 27 16S_DB:546 100.000 1342 0 0 1 1342 78 1419 0.0 2479 36 16S_DB:480 99.705 1358 3 1 1 1358 48 1404 0.0 2490 4 16S_DB:600 99.335 1354 4 4 1 1353 79 1428 0.0 2446 48 16S_DB:484 98.375 1354 20 1 2 1355 48 1399 0.0 2390 53 16S_DB:548 99.925 1342 1 0 1 1342 78 1419 0.0 2473 62 16S_DB:496 99.106 1342 8 4 1 1342 85 1422 0.0 2409 7 16S_DB:600 98.736 1345 14 3 1 1343 79 1422 0.0 2386
As can be seen in the output above, the -outfmt 6
tabular format has no header.
The default -outfmt 6 blast output contains the
following 12 fields
# The default m6 blast output fields #===================================================================================== # >>> The default -outfmt 6 | -m 8 blast output contains the following 12 fields <<< # qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore #------------------------------------------------------------------------------------- # 1. qseqid query (e.g., gene) sequence id # 2. sseqid subject (e.g., reference genome) sequence id # 3. pident percentage of identical matches # 4. length alignment length # 5. mismatch number of mismatches # 6. gapopen number of gap openings # 7. qstart start of alignment in query # 8. qend end of alignment in query # 9. sstart start of alignment in subject #10. send end of alignment in subject #11. evalue expect value #12. bitscore bit score #-------------------------------------------------------------------------------------
# 2.3 Add a header to the standard output; print to file and append to 16S_problema_blastN_maln1_m6.tab
echo -e "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore" > standard_blast_fields.tsv
cat standard_blast_fields.tsv 16S_problema_blastN_maln1_m6.tab > t && mv t 16S_problema_blastN_maln1_m6.tab
head -20 16S_problema_blastN_maln1_m6.tab | column -t
# 2.4) how many unique hits are there?
cut -f 2 16S_problema_blastN_maln1_m6.tab | sed '1d' | sort | uniq -c
# Generate the list of hit IDs for each query
# NOTE: the hits are in the same order as the query sequences
cut -f2 16S_problema_blastN_maln1_m6.tab | sed '1d' > IDs4blastdbcmd.txt
#blastdbcmd -i IDs4blastdbcmd.txt -d 16S_seqs4_blastDB.fnaed | grep '>' |cut -d\| -f3 | cut -d' ' -f2 > hit_sequences_b1.fna
blastdbcmd -entry_batch IDs4blastdbcmd.txt -db 16S_seqs4_blastDB.fnaed > best-hit_sequences.fna
# 1) Generate a list of the hits from the FASTA headers
grep '>' best-hit_sequences.fna | cut -d' ' -f2 > hit_sequence_headers.txt
# 2) Classify our query sequences, based on the subject headers of each best hit, using a simple tabular output format
cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d' > problem_seqs.txt
paste problem_seqs.txt hit_sequence_headers.txt > classified_16S_problema_sequences.tab
head -5 classified_16S_problema_sequences.tab
13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928
A more efficient and idiomatic (“Bashish”) code to achieve the same
result, without having to write intermediate files, is the use of
“process substitution”, which has this general syntax:
<(any comamnd(s)) or pipeline here). Process
substitution replaces the command with its output, treating it as if it
were stored in a file. This output can therefore be used by commands
like \(cat\), \(paste\) or any other that works on files.
Lets see it in action:
paste <(cut -f1 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head | column -t
13 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 Mycobacterium_chitae_T|ATCC_19627|X55603 7 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 ... TRUNCATED
The previous result looks great, but misses key statistical information about the robustness of the hits. Lets use the “process substitution” trick that we have just learned in the previous example to add some extra fields from the blast output table (pident, length, e-value & bit score) to our summary table, including a header.
# 3) add also columns 3, 4, 11, 12 (pident, length,e-value & bit score) to the tabular output, including a header
cat <(echo -e "sid\tpid\tlen\tEval\tscore\ttax") <( paste <(cut -f1,3,4,11,12 16S_problema_blastN_maln1_m6.tab | sed '1d') <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | head) | column -t
sid pid len Eval score tax 13 99.485 1359 0.0 2475 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 99.033 1344 0.0 2416 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 99.552 1340 0.0 2440 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 100.000 1342 0.0 2479 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 99.705 1358 0.0 2490 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 99.335 1354 0.0 2446 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 98.375 1354 0.0 2390 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 99.925 1342 0.0 2473 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 99.106 1342 0.0 2409 Mycobacterium_chitae_T|ATCC_19627|X55603 7 98.736 1345 0.0 2386 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 ... TRUNCATED
It is important to realize that finding a hit in the database does not necessarily mean that it supports a reliable identification or classification of the query sequence. In order to have a reasonably high confidence in the classification of 16S rDNA sequences at the genus level, based on a best \(blastn\) hit, it should at least satisfy the following conditions:
Lets first explore the BLAST table to identify hits that do not satisfy the two criteria defined above (pident >= 95% && length >= 500). \(AWK\) is ideal for this task.
# we could use any of the following lines to print the results, including a header
# cat <(cat standard_blast_fields.tsv) <(awk 'BEGIN{FS=OFS="\t"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab) | column -t
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tsstart\tsend\tevalue\tbitscore"}$3 < 95 || $4 < 500' 16S_problema_blastN_maln1_m6.tab | column -t
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore 1367SAA005_14Cip-FD1.b1.ab1_638 16S_DB:86 93.218 634 39 4 1 632 37 668 0.0 929 1367SAA005_A17Cip-FD1.b1.ab1_171 16S_DB:86 95.930 172 6 1 1 171 35 206 2.51e-76 278 1367SAA005_A6BSm-FD1.b1.ab1_477 16S_DB:86 98.745 478 5 1 1 477 37 514 0.0 848 1367SAA005_S13-FD1.b1.ab1_121 16S_DB:534 88.333 120 11 3 3 121 64 181 2.33e-35 141 1367SAA005_S2CbTm-FD1.b1.ab1_84 16S_DB:269 97.647 85 1 1 1 84 47 131 1.15e-36 145 1367SAA005_S4CbTm-FD1.b1.ab1_172 16S_DB:269 100.000 172 0 0 1 172 52 223 1.49e-88 318 1367SAA005_S6Sm-FD1.b1.ab1_347 16S_DB:184 99.143 350 0 3 1 347 40 389 0.0 627 1367SAA005_S9b-FD1.b1.ab1_142 16S_DB:3 97.887 142 3 0 1 142 40 181 1.60e-67 248 1367SAA005_Tm203-FD1.b1.ab1_491 16S_DB:128 97.942 486 9 1 3 487 52 537 0.0 841 1367SAA005_VNP1-FD1.b1.ab1_578 16S_DB:435 94.965 576 28 1 3 578 31 605 0.0 902
In order to get a confident blastn-based classification of our
problem sequences at the genus level, we should use only hits with
pident >= 95% and a minimal alignment
length >= 500 bp.
paste <(awk 'BEGIN{FS=OFS="\t"}NR > 1{print $1,$3,$4,$11,$12}' 16S_problema_blastN_maln1_m6.tab) <(grep '>' best-hit_sequences.fna | cut -d' ' -f2) | awk 'BEGIN{FS=OFS="\t"; print "sid\tpid\tlen\tEval\tscore\tbest_hit"}$2 >= 95 && $3 >= 500' > 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv
Display the table’s heat and tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv
cat <(head 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) <(tail 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv) | column -t
sid pid len Eval score best_hit 13 99.485 1359 0.0 2475 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 18 99.033 1344 0.0 2416 Mycobacterium_smegmatis_T|ATCC_19420|AJ131761 2 99.552 1340 0.0 2440 Mycobacterium_peregrinum_T|CIP_105382T|AY457069 27 100.000 1342 0.0 2479 Mycobacterium_fortuitum_T|CIP_104534T|AY457066 36 99.705 1358 0.0 2490 Mycobacterium_nonchromogenicum_T|ATCC_19530.|X52928 4 99.335 1354 0.0 2446 Mycobacterium_paraffinicum_T|ATCC_12670|GQ153270 48 98.375 1354 0.0 2390 Mycobacterium_gordonae_T|ATCC_14470.|X52923 53 99.925 1342 0.0 2473 Mycobacterium_neworleansense_T|ATCC_49404|AY457068 62 99.106 1342 0.0 2409 Mycobacterium_chitae_T|ATCC_19627|X55603 1367SAA005_S11Sm-FD1.b1.ab1_1049 98.763 1051 0.0 1866 Pseudomonas_monteilii_T|CIP_104883|AF064458 1367SAA005_S12-FD1.b1.ab1_899 98.210 894 0.0 1559 Microbacterium_testaceum_T|DSM_20166|X77445 1367SAA005_S15-FD1.b1.ab1_1111 98.115 1114 0.0 1936 Microbacterium_testaceum_T|DSM_20166|X77445 1367SAA005_S1CbTm-FD1.b1.ab1_683 99.227 647 0.0 1166 Pseudomonas_monteilii_T|CIP_104883|AF064458 1367SAA005_S1-FD1.b1.ab1_542 99.815 542 0.0 996 Pseudomonas_plecoglossicida_T|FPC951|AB009457 1367SAA005_S7CbTm-FD1.b1.ab1_961 98.954 956 0.0 1709 Citrobacter_freundii_T|DSM_30039|AJ233408 1367SAA005_S9-FD1.b1.ab1_1091 99.451 1092 0.0 1984 Cellulosimicrobium_funkei_T|W6122|AY501364 1367SAA005_VNC15-FD1.b1.ab1_1062 99.052 1055 0.0 1890 Rhizobium_alamii_T|type_strain:GBV016|AM931436 1367SAA005_VRP15_bis-FD1.b1.ab1_1124 99.822 1125 0.0 2065 Ensifer_terangae_T|type_strain:_LMG_7834|X68388 1367SAA005_VRP15-FD1.b1.ab1_1112 99.012 1113 0.0 1993 Ensifer_terangae_T|type_strain:_LMG_7834|X68388
The previous examples illustrated how to combine a set of \(Shell\) and \(Linux\) tools for efficient processing of BLAST output tables. With a little practice, you will soon master these power tools.
Finally, let’s generate some simple summary statistics of the classified sequences.
# 6 How many sequences could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | wc -l # 42
42
# 7 How many sequences per genus could be classified with high confidence?
sed '1d' 16S_problema_blastN_best-hit_classification_pidentGE95_alnGE500.tsv | cut -f6 | sed 's/_.*$//' | sort | uniq -c | sort -nrk1
14 Mycobacterium
5 Rhizobium
5 Ensifer
4 Pseudomonas
2 Microbacterium
2 Escherichia/Shigella
2 Beijerinckia
1 Sphingomonas
1 Sinorhizobium
1 Shinella
1 Isoptericola
1 Citrobacter
1 Cellulosimicrobium
1 Brevundimonas
1 Acinetobacter
Repeat the blastn-based classification exercise based on 16S rDNA sequences (section 7.1.4 and following sub-sections) using 10 hits per query. Impose the same filters applied in the previous exercise.
BLASTP is a key program in genomics and phylogenomics, as it is one of the main tools used to:
# Move into the UniProt_proteomes dir and extract the sequences in the tgz file
cd $wkdir && cd UniProt_proteomes
tar -xvzf UniProt_proteomes.tgz
The family of small Rab GTPases belongs to the Ras GTPase superfamily, which comprise over 150 human members, with evolutionarily conserved orthologs found throughout Eukaryota. The superfamily is divided into five major branches on the basis of sequence and functional similarities: Ras, Rho, Rab, Ran and Arf.
Small GTPases share a common biochemical mechanism and act as binary molecular switches, being active when GTP is bound and inactive when GDP is bound to the protein. Ras family proteins function as monomeric G proteins. Variations in structure and post-translational modifications that dictate specific subcellular locations and the proteins that serve as their regulators and effectors allow these small GTPases to function as sophisticated modulators of a remarkably complex and diverse range of cellular processes.
Rab GTPases are critical in membrane trafficking and insert into specific membrane compartments via their C-terminal geranylgeralyl lipidic tails, serving as organelle identity posts. In their GTP-bound (active) form, Rabs recruit multiple and specific effector proteins, which interact with membrane tethering (e.g. HOPS) and fusion factors (SNAREs), thereby serving a critical role in controlling the specific docking and fusion of membrane-bound compartments along the endocytic, recycling, exocytic, phagocytic and autophagic pathways.
In our research group at CCG-UNAM we use evolutionary, comparative, and functional genomics approaches to study Acanthamoeba castellanii and Stenotrophomonas spp. to identify essential loci and the molecular mechanisms that underlie the origins of complex phenotypes such as virulence, focusing on the cellular microbiology of Acanthamoeba-Stenotrophomonas interactions. We have recently found that Stenotrophomonas maltophilia, a globally emerging multidrug-resistant opportunistic pathogen is able to use Acanthamoeba trophozoites as an alternative replication niche.
Using a combination of bioinformatics, molecular and cell-biological approaches, we discrovered that Stenotrophomonas is resistant to digestion by A. castellanii, a free-living amoeba that we use in our lab as a proffesional phagocye model (Rivera et al. 2023, accepted). More specifically, we found that:
In the following exercise, we will use \(blastp\) to study the large family of Rab GTPases encoded by the Acanthamoeba castellanii genome, using the well-annotated and canonical RAB7A_HUMAN protein from as the query.
UniProt is the world’s leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information. You should explore this extraordinary resource and read the following paper from the UniProt Consortium - UniProt: the Universal Protein Knowledgebase in 2023.
makeblastdb -in ACACA.faa -dbtype prot -parse_seqids
ACACA.faa ACACA.faa.pdb ACACA.faa.phr ACACA.faa.pin ACACA.faa.pog ACACA.faa.pos ACACA.faa.pot ACACA.faa.psq ACACA.faa.ptf ACACA.faa.pto DICDI.faa
The aim of this exercise is to identify A. castellanii Ras-superfamily members, using RAB7A_HUMAN as the query. Given the huge evolutionary distance separating humans from Amoebozoa (probably > 1.2 billion yrs), we are going to use the BLOSUM45 matrix.
Fig. 1 from Brocks et. al 2023. Nature 2023 Jun;618(7966):767-773. doi: 10.1038/s41586-023-06170-w.
Furthermore, since the Ras superfamily is expected to contain many members, possibly several hundreds, we are going to request 300 hits, using custom output fields in tabular format.
blastp -matrix BLOSUM45 -query RAB7A_HUMAN.faa -db ACACA.faa -outfmt '6 qseqid qlen sseqid slen stitle length pident gaps evalue bitscore qcovs' -num_alignments 300 > RAB7A_HUMAN_vs_ACACA_300hits.out
wc -l RAB7A_HUMAN_vs_ACACA_300hits.out
244
There are many (244) rows to explore. It is convenient to get simple summary statistics for key alignment parameters, such as: - E-value (column 9) - bit_score (column 10) - qcovs (column 11) - slen (column 4) - pident (column 7)
$HOME/bin directory#!/usr/bin/env bash
# AUTHOR: Pablo Vinuesa. CCG-UNAM. @pvinmex
# little awk script to compute basic summary statistics
# receives the output of cut for a numeric column in a table
sort -n | awk '
BEGIN{ max = min = Mode = 'NaN' }
{
sum+=$1
a[x++]=$1
b[$1]++
if(b[$1]>hf){hf=b[$1]}
}
NR == 1 { min=$1; max=$1 }
$1 < min { min= $1 }
$1 > max { max= $1 }
END{ n = asort(a);idx=int((x+1)/2)
print "Min: " min
print "Mean: " sum/x
print "Median: " ((idx==(x+1)/2) ? a[idx] : (a[idx]+a[idx+1])/2)
for (i in b){if(b[i]==hf){(k=="") ? (k=i):(k=k FS i)}{FS=","}}
print "Mode: " k
print "Max: " max
}
'
Alternatively, you may fetch it with the following command from my GitHub repo
cd ~/bin
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/col_sumStats.sh
# takes you back again to your working directory
cd -
echo "# E-value"
cut -f 9 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# bit score"
cut -f 10 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# qcovs"
cut -f 11 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# pident"
cut -f 7 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
echo
echo "# slen"
cut -f 4 RAB7A_HUMAN_vs_ACACA_300hits.out | col_sumStats.sh
# E-value Min: 6.25e-103 Mean: 0.250727 Median: 9.805e-15 Mode: 0.001 Max: 6.9 # bit score Min: 26.3 Mean: 72.9398 Median: 68.85 Mode: 27.5 Max: 292 # qcovs Min: 9 Mean: 68.6885 Median: 77 Mode: 78 Max: 100 # pident Min: 20.000 Mean: 31.7386 Median: 30.5675 Mode: 31.452 Max: 71.635 # slen Min: 94 Mean: 440.131 Median: 217.5 Mode: 193,195 Max: 2812
From the output above, we can conclude that there are many hits with very poor E-values (>= 0.001) and scores (< 50). There is also great variation in the subject sequence lenght (slen) and query coverage (qcovs).
awk -F"\t" '{l=$4; a[l]++; if (a[l]>max) max=a[l]} END {printf("Length\tFrequency"); for (i in a) if(a[i] >= 2){printf("\n%s\t%s\t",i,a[i]); for (j=0;j<(int(a[i]*(100/max)));j++) printf("*")} print ""}' RAB7A_HUMAN_vs_ACACA_300hits.out
Length Frequency 178 2 ********************************* 179 3 ************************************************** 180 3 ************************************************** 181 4 ****************************************************************** 184 5 *********************************************************************************** 186 5 *********************************************************************************** 188 3 ************************************************** 190 4 ****************************************************************** 193 6 **************************************************************************************************** 194 2 ********************************* 195 6 **************************************************************************************************** 196 3 ************************************************** 197 3 ************************************************** 198 3 ************************************************** 199 4 ****************************************************************** 200 3 ************************************************** 202 2 ********************************* 203 3 ************************************************** 204 3 ************************************************** 205 2 ********************************* 206 5 *********************************************************************************** 207 2 ********************************* 208 3 ************************************************** 209 2 ********************************* 210 3 ************************************************** 211 3 ************************************************** 213 3 ************************************************** 216 5 *********************************************************************************** 217 2 ********************************* 218 2 ********************************* 220 2 ********************************* 221 3 ************************************************** 223 2 ********************************* 228 4 ****************************************************************** 244 3 ************************************************** 264 2 ********************************* 266 2 ********************************* 273 2 ********************************* 421 2 ********************************* 514 2 ********************************* 558 2 ********************************* 578 2 ********************************* 988 2 *********************************
Seems that most homologs fall in the range between 178 and 250.
# save the subject-length column data to a file to be read by R
cut -f4 RAB7A_HUMAN_vs_ACACA_300hits.out > qlen.list
# call R
R
# read data
dfr <- read.csv("qlen.list", header=F)
# plot
hist(dfr$V1, breaks = "Freedman-Diaconis", xlim = c(0, 3000), probability=TRUE)
abline(v = median(dfr$V1), col = "blue", lwd = 3)
abline(v = mean(dfr$V1), col = "red", lwd = 3)
lines(density(dfr$V1), col = 'green', lwd = 3)
which generates the following graph:
Figure 4. Histogram and density plot of RAB7A_HUMAN \(blastp\) hit lengths in the A. castellanii proteome. The median and mean lengths are highlighted as blue and red vertical bars, respectively.
From the previous exploratory data analysis, we can conclude that the hits for likely canonical small Rab-GTPase family homologues should meet the following (relaxed) attributes:
- slen >= 178 && slen < 270 - E-value < 1e-10 - bit-score > 70 - qcovs > 72 - pident > 28
Filtering the table according to those criteria yields 80 hits:
awk 'BEGIN{FS=OFS="\t"; print "sseqid\tslen\tpident\tgaps\tevalue\tbitscore\tqcovs"} ($4 > 178 && $4 < 270) && $7 >= 28 && $9 < 1e-10 && $10 > 70 && $11 > 72 {print $3,$4,$7,$8,$9,$10,$11}' RAB7A_HUMAN_vs_ACACA_300hits.out | column -t -s $'\t'
sseqid slen pident gaps evalue bitscore qcovs tr|L8HHF3|L8HHF3_ACACA 186 71.635 23 6.25e-103 292 100 tr|L8GPH2|L8GPH2_ACACA 205 58.537 2 1.20e-86 251 99 tr|L8GXM9|L8GXM9_ACACA 197 60.736 2 5.09e-64 193 78 tr|L8GRY8|L8GRY8_ACACA 186 47.343 28 1.02e-58 179 98 tr|L8GLL4|L8GLL4_ACACA 206 50.000 4 4.58e-54 168 78 ... TRUNCATED tr|L8GLE5|L8GLE5_ACACA 193 28.302 3 9.76e-18 74.2 77 tr|L8H8M5|L8H8M5_ACACA 228 28.986 27 1.72e-17 74.2 90 tr|L8GIH3|L8GIH3_ACACA 198 29.944 11 3.36e-17 73.0 86 tr|L8GIM0|L8GIM0_ACACA 203 34.177 7 9.07e-17 71.9 75 tr|L8HCN9|L8HCN9_ACACA 194 31.447 32 1.04e-16 71.6 77
To further validate the selected hits as canonical small Rab-GTPases, these should be aligned and subjected to phylogenetic analysis along with reference sequences. We will perform this analysis in later sections of the course.
In the previous exercise we identified a selection of likely Rab-GTPase homologs in the A. castellanii proteome. Our next aim is to identify reciprocal best-hits or RBHs between a set of human reference Rabs retrieved from UNIPROT, and saved in file HUMAN_63Rabs_Ran.faa , and the A. castellanii proteome. The resulting RBHs are likely orthologs, and therefore we could transfer the annotation of the HUMAN Rabs to their Acanthamoeba RBHs.
The bioinformatic strategy to find RBHs is be implemented in the following steps:
Whenever possible use symlinks to avoid unnecessary file copying to preserve precious disk space
mkdir RBHs
cd RBHs/
# make symlinks whenever possible, to avoid unnecessary file copying to avoid filling the server's disks
ln -s ../ACACA.faa .
ln -s ../ACACA.faa.* .
ln -s ../HUMAN_63Rabs_Ran.faa .
Given the huge evolutionary distance between human and amoebae, we shall use the BLOSUM45 matrix to maximize sensitivity.
blastp -query HUMAN_63Rabs_Ran.faa -db ACACA.faa -matrix BLOSUM45 -outfmt 6 -num_alignments 10 > HUMAN_63Rabs_vs_ACACA_10hits_m6.out
blastdbcmd -entry_batch <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) -db ACACA.faa > ACACA_vs_HUMAN_best_Rab_hits.faa
makeblastdb -in HUMAN_63Rabs_Ran.faa -dbtype prot -parse_seqids
blastp -query ACACA_vs_HUMAN_best_Rab_hits.faa -db HUMAN_63Rabs_Ran.faa -matrix BLOSUM45 -outfmt 6 -num_alignments 10 > ACACA_vs_HUMAN_RBHs_m6.out
# The following lines extract a sorted list of unique Acanthamoeba identifiers
# found in the ACA_vs_HUMAN and HUMAN_vs_ACA searches
cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort | uniq -c
cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort -u
# grep out some of the Acanthamoeba hits from the HUMAN_63Rabs_vs_ACACA_10hits_m6.out
# and sort the resulting list by increasin E-value (on col 11)
grep -E 'L8GF16|L8GYY7|L8H946|L8HHF3' HUMAN_63Rabs_vs_ACACA_10hits_m6.out | sort -gk11,11
# This code greps out the sorted and uniique ACACA hits from the HUMAN_63Rabs_vs_ACACA_10hits_m6.out table
for ACAid in $(cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_10hits_m6.out; done
# Same as previous line, but sorting the blastp output table by increasing E-values (on column 11; sort -gk11,11)
# output displayed below
for ACAid in $(cat <(cut -f1 ACACA_vs_HUMAN_RBHs_m6.out) <(cut -f2 HUMAN_63Rabs_vs_ACACA_10hits_m6.out) | sort -u); do grep $ACAid HUMAN_63Rabs_vs_ACACA_10hits_m6.out; done | sort -gk11,11 > HUMAN_63Rabs_vs_ACACA_10hits_m6.out.sorted
sp|P62826|RAN_HUMAN L8GF16 82.524 206 36 0 11 216 8 213 7.64e-129 359 sp|P61106|RAB14_HUMAN L8H946 73.953 215 54 1 1 215 1 213 2.88e-117 329 sp|P51149|RAB7A_HUMAN L8HHF3 71.635 208 36 4 1 207 1 186 6.25e-103 292 sp|P61019|RAB2A_HUMAN L8H946 65.241 187 65 0 3 189 6 192 1.19e-90 261 sp|Q8WUD1|RAB2B_HUMAN L8H946 65.079 189 66 0 1 189 4 192 3.44e-90 260 sp|P61018|RAB4B_HUMAN L8H946 66.857 175 58 0 5 179 6 180 6.71e-89 257 sp|P20338|RAB4A_HUMAN L8H946 66.854 178 59 0 7 184 3 180 7.71e-88 254 sp|Q9NP72|RAB18_HUMAN L8GYY7 58.163 196 80 2 9 203 15 209 4.10e-80 234 sp|P51151|RAB9A_HUMAN L8HHF3 48.241 199 83 3 4 201 5 184 2.34e-62 188 sp|Q15907|RB11B_HUMAN L8H946 52.874 174 82 0 8 181 6 179 9.51e-62 188 sp|P62491|RB11A_HUMAN L8H946 49.474 190 96 0 8 197 6 195 1.23e-61 187 sp|Q9NP90|RAB9B_HUMAN L8HHF3 46.970 198 86 2 4 200 5 184 2.92e-60 183 ... TRUNCATED
Note: if you need help with AWK and Linux for bioinformatics, visit my detailed AWK tutorial for bioinformatics here
awk 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1) print }' HUMAN_63Rabs_vs_ACACA_10hits_m6.out.sorted
awk 'BEGIN{FS=OFS="\t"}{HUMANid[$1]++; ACAid[$2]++; if(HUMANid[$1] == 1 && ACAid[$2] == 1) print }' HUMAN_63Rabs_vs_ACACA_10hits_m6.out.sorted > HUMAN_63Rabs_vs_ACACA_RBHs.tsv
sp|P62826|RAN_HUMAN L8GF16 82.524 206 36 0 11 216 8 213 7.64e-129 359 sp|P61106|RAB14_HUMAN L8H946 73.953 215 54 1 1 215 1 213 2.88e-117 329 sp|P62491|RB11A_HUMAN L8HJ43 75.829 211 51 0 3 213 2 212 9.90e-117 328 sp|Q8WUD1|RAB2B_HUMAN L8H2N2 70.046 217 63 2 1 216 1 216 4.48e-110 311 sp|P20340|RAB6A_HUMAN L8GN15 81.287 171 32 0 8 178 4 174 1.25e-104 296 sp|P51149|RAB7A_HUMAN L8HHF3 71.635 208 36 4 1 207 1 186 6.25e-103 292 sp|P61018|RAB4B_HUMAN L8HCK2 65.888 214 62 3 1 213 1 204 2.29e-102 291 sp|Q13637|RAB32_HUMAN L8HAC7 68.681 182 55 2 23 202 11 192 9.57e-94 270 sp|P62820|RAB1A_HUMAN L8HD31 69.412 170 52 0 7 176 3 172 4.80e-91 262 sp|Q9NP72|RAB18_HUMAN L8GYY7 58.163 196 80 2 9 203 15 209 4.10e-80 234 sp|Q92930|RAB8B_HUMAN L8HGB4 72.917 144 39 0 3 146 12 155 4.75e-80 233 sp|P20339|RAB5A_HUMAN L8H6F4 76.415 106 25 0 16 121 2 107 8.12e-59 178 sp|Q13636|RAB31_HUMAN L8GKG8 46.734 199 97 3 5 194 19 217 2.73e-58 178 sp|O95755|RAB36_HUMAN L8H339 47.926 217 101 6 82 294 14 222 3.82e-57 181 sp|Q969Q5|RAB24_HUMAN L8H2Q9 50.549 182 72 4 6 173 4 181 6.20e-56 172
AIMS & data:
Background info: The 3cass_amplicons.fna sequences were generated in our laboratory from diverse multidrug-resistant Enterobacteriaceae isolates recovered from polluted rivers in Morelos, Mexico. We used PCR to screen our collection of isolates for the presence of the intI integrase gene with the conserved HS458/HS459 primers, which is a hallmark of class-I integrons. Positive isolates were then subjected to PCR with the conserved CS-F & CS-R primers that amplify gene cassette arrays captured by integrons.
Figure 2. The structure of a class-I integron.
If you like, have a look at this nice video-protocol on Screening Foodstuffs for Class 1 Integrons and Gene Cassettes by Liette S. Waldron and Michael R. Gillings from the Department of Biological Sciences, Macquarie University. June 19, 2015 in J Vis Exp. 2015; (100): 52889. doi: 10.3791/52889.
# Move into the BLAST_DB_AA dir and extract the sequences in the tgz file
cd $wkdir && cd BLAST_DB_AA
tar -xvzf gene_discovery_and_annotation_using_blastx.tgz
# count sequences
grep -c integron_cassettes4blastdb.faa # 2297 sequences
# explore headers
grep '^>' integron_cassettes4blastdb.faa | head -20
>gi|17026024 |[Achromobacter xylosoxidans]|AXI2|||||||Japan:Maebashi||clinical strain; synonym:Alcaligenes xylosoxidans||blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|Iyobe_S.|Antimicrob.AgentsChemother.46_2014-2016_2002|12019129 >gi|17026026 |[Acinetobacter baumannii]|ABI1|||||||Japan:Maebashi||clinical strain||blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436|unpubl >gi|34482004 |[Vibrio cholerae O1]||||||||||||aadA1|aminoglycoside adenyltransferase|792|AB107663|unpubl >gi|38524487 |[Vibrio cholerae non-O1/non-O139]|||||Non O1/Non O139|||||||dfrA15|dihydrofolate reductase|474|AB113114|unpubl >gi|51699488 |[Escherichia coli]|D49||feces|||||China: Guangzhou||patient||aadA1|aminoglycoside adenylyltransferase|792|AB189263|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182 ...
# 1. change >gi|17026026 |[Acinetobacter baumannii]|ABI1... ==>
# >gnl|casseteDB|17026026 |[Acinetobacter baumannii]|ABI1...
# and
# 2. simplify the header a bit, to avoid the many empty fields
grep '>' integron_cassettes4blastdb.faa | head -1 | sed 's#|#\n#g' | cat -n
1 >gi
2 17026024
3 [Achromobacter xylosoxidans]
4 AXI2
5
6
7
8
9
10
11 Japan:Maebashi
12
13 clinical strain; synonym:Alcaligenes xylosoxidans
14
15 blaIMP-10
16 metallo-beta-lactamase IMP-10
17 741
18 AB074435
19 Iyobe_S.
20 Antimicrob.AgentsChemother.46_2014-2016_2002
21 12019129
# use an AWK one-liner to do the complete job. Use the list of |-delimited fields listed above to select some of the most relevant ones
# NOTE that we add an extra field in the following transformation: >gi|17026026 => >gnl|casseteDB|17026026;
# therefore we need to add one to the field numbers listed above in order to get the desired ones
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa | grep '^>' | head -5
>gnl|casseteDB|17026024 |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129 >gnl|casseteDB|17026026 |[Acinetobacter baumannii]|blaIMP-11|metallo-beta-lactamase IMP-11|738|AB074436| >gnl|casseteDB|34482004 |[Vibrio cholerae O1]|aadA1|aminoglycoside adenyltransferase|792|AB107663| >gnl|casseteDB|38524487 |[Vibrio cholerae non-O1/non-O139]|dfrA15|dihydrofolate reductase|474|AB113114| >gnl|casseteDB|51699488 |[Escherichia coli]|aadA1|aminoglycoside adenylyltransferase|792|AB189263|16451182
# save the edited faa file as integron_cassettes4blastdb.faaED
awk 'BEGIN{FS=OFS="|"}/>/{sub(/^>gi/, ">gnl|casseteDB"); print $1,$2,$3,$4,$16,$17,$18,$19,$22}!/>/{print $1}' integron_cassettes4blastdb.faa > integron_cassettes4blastdb.faaED
# lets check the result a final time
grep '^>' integron_cassettes4blastdb.faaED | head -20
# 2) format DB with makeblastdb
# formatdb -i integron_cassettes4blastdb.faaED -p T -o T -n integron_cassetteDB # legacy blast
makeblastdb -in integron_cassettes4blastdb.faaED -dbtype prot -parse_seqids
Building a new DB, current time: 10/18/2022 19:04:31 New DB name: /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED New DB title: integron_cassettes4blastdb.faaED Sequence type: Protein Deleted existing Protein BLAST database named /home/vinuesa/cursos/intro_biocomp/sesion2_blast/BLAST_DB_AA/integron_cassettes4blastdb.faaED Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 2297 sequences in 0.490659 seconds.
# List the output generated by makeblastdb, sorted by modification time in reverse manner
ls -ltr
-rw-r--r--. 1 vinuesa faculty 4505 ago 31 2011 3cass_amplicons.fna -rw-r--r--. 1 vinuesa faculty 924971 ago 31 2011 integron_cassettes4blastdb.faa -rw-r--r--. 1 vinuesa faculty 1129 jul 26 2019 unknown_prot.faa -rw-r--r--. 1 vinuesa faculty 751713 oct 10 20:15 gene_discovery_and_annotation_using_blastx.tgz -rw-r--r--. 1 vinuesa faculty 803081 oct 18 19:03 integron_cassettes4blastdb.faaED -rw-r--r--. 1 vinuesa faculty 566907 oct 18 19:04 integron_cassettes4blastdb.faaED.psq -rw-r--r--. 1 vinuesa faculty 9220 oct 18 19:04 integron_cassettes4blastdb.faaED.pog -rw-r--r--. 1 vinuesa faculty 18520 oct 18 19:04 integron_cassettes4blastdb.faaED.pin -rw-r--r--. 1 vinuesa faculty 329297 oct 18 19:04 integron_cassettes4blastdb.faaED.phr -rw-r--r--. 1 vinuesa faculty 9192 oct 18 19:04 integron_cassettes4blastdb.faaED.pto -rw-r--r--. 1 vinuesa faculty 27572 oct 18 19:04 integron_cassettes4blastdb.faaED.pot -rw-r--r--. 1 vinuesa faculty 16384 oct 18 19:04 integron_cassettes4blastdb.faaED.ptf -rw-r--r--. 1 vinuesa faculty 63027 oct 18 19:04 integron_cassettes4blastdb.faaED.pos -rw-r--r--. 1 vinuesa faculty 188416 oct 18 19:04 integron_cassettes4blastdb.faaED.pdb
The aim of the exercise is to assign functions or annotations to the problem proteins found in file unknown_prot.faa
# How many proteins sequences are stored in unknown_prot.faa
grep -c '^>' unknown_prot.faa
cat unknown_prot.faa
# How do we run blast? => explore the program's help output
blastp -h # short format
blastp -help # long format
# launch a standard blastp search, asking the program to provide a tabular format for the top 10 hits
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt 6 -num_alignments 10
#>>> Lets add some interesting information to the table, missing in the standard -outfmt 6 output
# run blastp using outfmt 6 and passing a user-specified formatting string
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out
# prepare a header file with the field names for easier interpretation of the blastp ouptput table
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
cat header1.tab
# add the header to the output table
cat header1.tab blastp_10hits.out > blastp_10hits.tsv
column -t blastp_10hits.tsv | head -20
qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|87128428 266 267 1 266 1 266 266 100.000 266 0 0 0.0 537 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|145321080 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|44844209 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|102231612 266 267 1 266 1 266 266 99.248 266 2 0 0.0 535 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|38492196 266 267 1 266 1 266 266 90.226 256 26 0 0.0 497 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|118402723 266 267 1 266 1 266 266 89.850 255 27 0 0.0 494 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|66277418 266 267 1 266 1 266 266 89.098 252 29 0 3.06e-179 489 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|148455769 266 267 1 266 1 266 266 87.594 248 33 0 1.83e-176 483 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|30014072 266 266 1 266 1 265 266 73.684 223 69 1 6.02e-146 405 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|17026024 266 247 43 252 24 227 211 33.175 120 133 8 1.82e-37 128 79 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|56691997 390 387 5 390 1 386 386 99.482 384 2 0 0.0 789 99 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|146151072 390 383 5 388 1 382 388 45.361 251 202 10 2.45e-114 335 98 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|40287460 390 382 13 387 5 381 379 39.314 232 224 6 7.18e-96 287 96 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905118 390 410 13 387 33 409 379 39.578 231 223 6 5.97e-95 286 96 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|76057210 390 380 1 387 1 379 396 38.131 232 219 26 4.63e-88 267 99 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|146151112 390 276 316 380 99 165 67 22.388 34 50 2 2.4 25.8 17 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905089 390 126 4 37 2 35 34 38.235 22 21 0 5.8 23.9 9 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|161087277 390 80 133 201 15 78 79 25.316 34 34 25 6.9 23.1 18 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905132 390 101 255 287 71 98 33 36.364 18 16 5 7.4 23.1 8 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|133905029 390 337 100 154 247 287 55 21.818 24 29 14 7.8 24.3 14
As in the previous exercise on 16S rDNA sequence classification, for a confident annotation of a protein it is important to consider some basic alignment statistics.
In our particular case, since the query proteins correspond to gene cassettes encoding for antibiotic resistance, we are interested in identifying proteins down to the variant level. Therefore, we will impose very stringent filtering criteria to keep only hits with high pident and qcov.
# Print numbered list of fasta header fields to filter hits with pident >= 95% and qcovs >= 97%
sed 's/\t/\n/g' header1.tab | cat -n
1 qseqid
2 sseqid
3 qlen
4 slen
5 qstart
6 qend
7 sstart
8 send
9 length
10 pident
11 positive
12 mismatch
13 gaps
15 bitscore
16 qcovs
# Filter out hits with >= 95% pident && qcovs >= 97%
awk 'BEGIN{FS=OFS="\t"} $10 >= 95 && $16 >= 97' blastp_10hits.tsv | column -t
qseqid sseqid qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|87128428 266 267 1 266 1 266 266 100.000 266 0 0 0.0 537 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|145321080 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|44844209 266 267 1 266 1 266 266 99.624 266 1 0 0.0 536 100 gi|11037699|gb|AAG27703.1|AF300454_1 gnl|casseteDB|102231612 266 267 1 266 1 266 266 99.248 266 2 0 0.0 535 100 gi|9294830|gb|AAF86698.1|AF180959_1 gnl|casseteDB|56691997 390 387 5 390 1 386 386 99.482 384 2 0 0.0 789 99
# add a subject title (stitle), to get the description line of the proteins we are finding as top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out
cat header1.tab blastp_10hits.out > blastp_10hits.tsv
cut -f1,5,7,13 blastp_10hits.tsv | head -20 | column -ts $'\t'
qseqid stitle pident qcovs gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344| 100.000 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas putida]|blaVIM-6|metallo-beta-lactamase VIM-6|801|EF522838|18757180 99.624 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM|class B beta-lactamase|801|AJ628983| 99.624 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-2|metallo-beta-lactamase VIM-2|801|DQ522233| 99.248 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-4|metallo-beta-lactamase VIM-4|801|AY460181| 90.226 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-1|VIM-1 metallo-beta-lactamase|801|AJ969237| 89.850 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Enterobacter cloacae]|blaVIM-5|metallo-beta-lactamase VIM-5|801|DQ023222|16189133 89.098 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-13|class B carbapenemase VIM-13|801|EF577407|18644957 87.594 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-7|metallo-b-lactamase|798|AJ536835|14693560 73.684 100 gi|11037699|gb|AAG27703.1|AF300454_1 |[Achromobacter xylosoxidans]|blaIMP-10|metallo-beta-lactamase IMP-10|741|AB074435|12019129 33.175 79 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793 99.482 99 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]||cmy-8|1149|EF382672|17526756 45.361 98 gi|9294830|gb|AAF86698.1|AF180959_1 |[Escherichia coli]|cmy-13|CMY-13|1146|AY339625|16048979 39.314 96 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Newport str. SL254]|blaCMY-2-1|CMY-2 beta-lactamase 1|1230|CP000604|17375195 39.578 96 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]|ampC|beta-lactamase class C DHA-1|1140|AJ971343|16436717 38.131 99 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]||sfpB|828|EF382672|17526756 22.388 17 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||hypothetical protein|378|CP000604|17375195 38.235 9 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Choleraesuis]|orf20|Orf20|240|EU219534| 25.316 18 gi|9294830|gb|AAF86698.1|AF180959_1 |[Salmonella enterica subsp. enterica serovar Newport str. SL254]||conserved hypothetical protein|303|CP000604|17375195 36.364 8
#>>> retrieve the best hit found for each input gene
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 1 > blastp_tophits.out
cat header1.tab blastp_tophits.out > blastp_tophits.tsv
cut -f1,5,7,13 blastp_tophits.tsv | head | column -ts $'\t'
qseqid stitle pident qcovs gi|11037699|gb|AAG27703.1|AF300454_1 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344| 100.000 100 gi|9294830|gb|AAF86698.1|AF180959_1 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793 99.482 99 gi|15705413|gb|AAL05630.1|AF395881_1 |[Klebsiella pneumoniae]|bla|extended-spectrum beta-lactamase CTX-M-19|876|AF458080|12936998 52.158 95
We can easily retrieve the top-scoring hits from the database using blastdbcmd using a list of qseqids, as shown below.
# get the sseqids only for the top hits with % sequence identity >= 95 and query coverage >= 97%
awk 'BEGIN{FS=OFS="\t"}NR > 1 && $7 >= 95 && $13 >= 97{print $2}' blastp_tophits.tsv > sseqids_for_top_hits_with_ge95pident_ge97qcovs.list
# finally retrieve the sequences for the sseqids_for_top_hits_with_ge95pident_ge97qcovs.list from the integron_cassettes4blastdb
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch sseqids_for_top_hits_with_ge95pident_ge97qcovs.list -dbtype prot -out top_hits_with_ge95pident_ge97qcovs.faa
cat top_hits_with_ge95pident_ge97qcovs.faa
>casseteDB:87128428 |[Pseudomonas aeruginosa]|blaVIM-3|VIM-3 metallo-beta-lactamase|801|DQ342344| MFKLLSKLLVYLTASIMAIASPLAFSVDSSGEYPTVSEIPVGEVRLYQIADGVWSHIATKSFDGAVYPSNGLIVRDGDEL LLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRVGGVDVLRAAGVATYASPSTRRLAEVEGSEIPTHSLEGLSS SGDAVRFGPVELFYPGAAHSTDNLVVYVPSASVLYGGCAIYELSRTSAGNVADADLAEWPTSIERIQQHYPEAQFVIPGH GLPGGLDLLKHTTNVVKAHTNRSVVE* >casseteDB:56691997 |[Klebsiella pneumoniae]|acc-1|beta-lactamase|1161|AJ870923|16982793 MQNTLKLLSVITCLAATVQGALAANIDESKIKDTVDDLIQPLMQKNNIPGMSVAVTVNGKNYIYNYGLAAKQPQQPVTEN TLFEVGSLSKTFAATLASYAQVSGKLSLDQSVSHYVPELRGSSFDHVSVLNVGTHTSGLQLFMPEDIKNTTQLMAYLKAW KPADAAGTHRVYSNIGTGLLGMIAAKSLGVSYEDAIEKTLLPQLGMHHSYLKVPADQMENYAWGYNKKDEPVHVNMEILG NEAYGIKTTSSDLLRYVQANMGQLKLDANAKMQQALTATHTGYFKSGEITQDLMWEQLPYPVSLPNLLTGNDMAMTKSVA TPIVPPLPPQENVWINKTGSTNGFGAYIAFVPAKKMGIVMLANKNYSIDQRVTVAYKILSSLEGNK*
If you read the detailed documentation of \(blastp\) blastp -help, you
will find under the output section that -outfmt 7 is a tabular with
comments.
Here a few lines of code to generate a standard tabular output with a
customized header, parsing the -outfmt 7 output. Note the
use of cmd <(command pipeline), a syntax called
process substitution, which allows to use the ouptut of \(command\ pipeline\) as if it were a file
for input into \(cmd\). This is
convenient, as it avoids the need of writing an intermediate file.
# 1. Run blast[p] with -outfmt 7 and the desired output fields
blastp -query unknown_prot.faa -db integron_cassettes4blastdb.faaED -outfmt '7 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blast_m7.tmp
# 2. use process substitution to parse the -outfmt 7 output file blast_m7.tmp, and generate the standard tabular optput with labeled columns
cat <(awk '/^# Fields:/ {sub(/# Fields: /, ""); gsub(/, /, "\t"); gsub(/ /, "_"); print $0; exit}' blast_m7.tmp) <(awk '!/^# /{print $0 }' blast_m7.tmp) | less
The aim here is to identify the possible gene-cassettes encoded in the PCR amplicons collected in 3cass_amplicons.fna, as explained in the previous background info section.
BLASTX is generally used to find CDSs (protein-coding sequences) in genomic DNA or transcripts. As we will see, BLASTX is a powerful gene-finding tool, but we will need to examine many alignments to ensure that all CDSs encoded on a given DNA segment are found. For large DNA strings (>5-10 kb), I would recommend to segment the string into shorter, overlapping pieces, before running BLASTX
Remember that BLASTX uses a DNA query, which is translated in its 6 frames using by default the universal code, and the resulting products are used to interrogate a protein database.
# run blastx asking for the top 3 hits in tabular output format and an evalue cutoff set at 1e-100
# adding relevant information to the table, missing in the standard -m 6 output, like qlen, selen, qframe & qcovs
echo -e 'qseqid\tsseqid\tqlen\tslen\tqstart\tqend\tqframe\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-100 -outfmt '6 qseqid sseqid qlen slen qstart qend qframe sstart send length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 3 -out 3blastX_hits_m6_header1.tab
# if 3blastX_hits_m6_header1.tab exists and has a non-zero size, then execute the following commands ...
[ -s 3blastX_hits_m6_header1.tab ] && cat header1.tab 3blastX_hits_m6_header1.tab > ed && mv ed 3blastX_hits_m6_header1.tab
head 3blastX_hits_m6_header1.tab | column -t
qseqid sseqid qlen slen qstart qend qframe sstart send length pident positive mismatch gaps evalue bitscore qcovs Am1 gnl|casseteDB|146151084 1706 273 887 1699 2 2 273 272 98.897 270 2 1 0.0 536 48 Am1 gnl|casseteDB|157674384 1706 270 899 1699 2 4 270 267 99.625 266 1 0 0.0 533 47 Am1 gnl|casseteDB|161087337 1706 268 899 1699 2 2 268 267 99.625 266 1 0 0.0 533 47 Ap18 gnl|casseteDB|90265366 1762 277 39 836 3 11 276 266 95.865 258 11 0 0.0 508 45 Ap18 gnl|casseteDB|84095097 1762 285 39 836 3 19 284 266 95.489 258 12 0 3.97e-180 506 45 Ap18 gnl|casseteDB|156628012 1762 285 39 836 3 19 284 266 95.113 257 13 0 2.43e-179 504 45 Ap26 gnl|casseteDB|156628012 945 285 16 870 1 1 285 285 99.649 285 1 0 0.0 573 90 Ap26 gnl|casseteDB|133905080 945 336 1 870 1 47 336 290 97.241 285 8 0 0.0 572 92 Ap26 gnl|casseteDB|84095097 945 285 16 870 1 1 285 285 99.649 284 1 0 0.0 572 90
As we saw in the output from the previous \(blastx\) run, the hits for each query sequence align only to a certain portion of the latter. We will need to substantially increase the number of hits to have a chance to get lower-scoring HSPs that align to other regions (smaller genes) of the DNA sequence used as query.
#>>> Lets try to improve the analysis by increasing the no. of hits for each sequence to 100
# and adding the stitle column to the output table
echo -e 'qseqid\tsseqid\tstitle\tqlen\tslen\tqstart\tqend\tsstart\tsend\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs\tqframe' > header2.tab
#>>> print a numvered list of the positional indexes of each column
sed 's/\t/\n/g' header2.tab | nl
1 qseqid
2 sseqid
3 stitle
4 qlen
5 slen
6 qstart
7 qend
8 sstart
9 send
10 length
11 pident
12 positive
13 mismatch
14 gaps
15 evalue
16 bitscore
17 qcovs
18 qframe
blastx -query 3cass_amplicons.fna -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-10 -outfmt '6 qseqid sseqid stitle qlen slen qstart qend sstart send length pident positive mismatch gaps evalue bitscore qcovs qframe' -num_alignments 100 -out 100blastX_hits_m6_header2.tab
[ -s 100blastX_hits_m6_header2.tab ] && cat header2.tab 100blastX_hits_m6_header2.tab > ed && mv ed 100blastX_hits_m6_header2.tab
# print relevant table fields
cut -f1,2,6,7,11,17,18 100blastX_hits_m6_header2.tab | column -t
qseqid sseqid qstart qend pident qcovs qframe Am1 gnl|casseteDB|146151084 887 1699 98.897 48 2 Am1 gnl|casseteDB|157674384 899 1699 99.625 47 2 Am1 gnl|casseteDB|161087337 899 1699 99.625 47 2 Am1 gnl|casseteDB|10444107 908 1699 100.000 46 2 Am1 gnl|casseteDB|21898659 908 1699 99.621 46 2 ... Am1 gnl|casseteDB|38491398 908 1693 72.137 46 2 Am1 gnl|casseteDB|45502091 908 1516 87.685 36 2 Am1 gnl|casseteDB|10444105 3 500 100.000 29 3 Am1 gnl|casseteDB|115605632 3 500 99.398 29 3 Am1 gnl|casseteDB|48526059 3 500 99.398 29 3 Am1 gnl|casseteDB|56541576 3 500 98.795 29 3 Am1 gnl|casseteDB|110084170 899 1678 60.769 46 2 Am1 gnl|casseteDB|161087327 3 488 100.000 28 3 Am1 gnl|casseteDB|110588863 908 1678 60.311 45 2 ... Am1 gnl|casseteDB|10444106 612 902 98.969 17 3 Am1 gnl|casseteDB|6003516 612 902 97.938 17 3 Am1 gnl|casseteDB|56541577 612 902 97.938 17 3 Am1 gnl|casseteDB|21898658 612 902 94.845 17 3 Am1 gnl|casseteDB|14091025 588 902 78.095 18 3 Am1 gnl|casseteDB|110588887 612 902 78.351 17 3 Am1 gnl|casseteDB|38492198 588 887 77.000 18 3 Am1 gnl|casseteDB|134035531 612 887 76.087 16 3 Am1 gnl|casseteDB|82698267 612 875 76.136 15 3 Am1 gnl|casseteDB|82698261 612 902 76.289 17 3 Am1 gnl|casseteDB|111661582 18 488 36.709 28 3 Am1 gnl|casseteDB|164449571 18 488 38.608 28 3 ...
What can we conclude from the selected output lines shown above for Am1 hits?
#>>> This 1liner generates a non-redundant list of the qstart qend positions for each of the query sequences.
# These are extracted from the full table with $(cut -f1 100blastX_hits_m6_header2.tab | grep -v qseqid | sort -u)
# Then, for each query, grep it out of the table with grep $query 100blastX_hits_m6_header2.tab and keep only the fields 6,7: qstart, qend
# We separte and label the results corresponding to each query sequence using the echo "#>>> $query"; as header
# End the corresponding rerport with echo '------------------------';
for query in $(cut -f1 100blastX_hits_m6_header2.tab | sed '1d' | sort -u)
do
echo "#>>> $query"
grep $query 100blastX_hits_m6_header2.tab | cut -f6,7 | sort -u
echo '------------------------'
done
##### OUTPUT of the for loop from the previous block #>>> Am1 1361 1699 15 488 18 488 33 488 3 488 3 500 588 887 588 902 612 875 612 887 612 902 887 1699 899 1678 899 1699 908 1303 908 1516 908 1678 908 1684 908 1693 908 1696 908 1699 920 1699 ------------------------ #>>> Ap18 1046 1762 1052 1762 1079 1528 1079 1702 1085 1738 1088 1738 1118 1738 1130 1738 1139 1738 36 836 39 812 39 836 48 173 48 443 48 656 48 797 48 812 48 830 48 836 48 860 501 836 60 836 ------------------------ #>>> Ap26 1 192 16 867 16 870 1 870 532 870 58 870 61 870 70 870 79 204 79 474 79 687 79 861 79 864 79 867 79 870 91 870 ------------------------
# Identify the most likely gene start|end positions
grep Am1 100blastX_hits_m6_header2.tab | cut -f6,7 | sort | uniq -c | sort -nk2
1 3 488
4 3 500
2 15 488
13 18 488
2 33 488
1 588 887
1 588 902
1 612 875
1 612 887
6 612 902
1 887 1699
1 899 1678
6 899 1699
1 908 1303
1 908 1516
2 908 1696
35 908 1699
3 908 1693
4 908 1684
6 908 1678
7 920 1699
1 1361 1699
#>>> Lets explore this in more detail. We'll start with the Am1 query
# One class of hits seems to start at position 3, ending at around position 500
# What are the hits? Lets grep them out of the table;
# note the use of a regular expression '3[[:space:]]500' which denotes a 3 followed by a space or tab,
# followed by a 500 (i.e. our qstart qend coordinates)
grep Am1 100blastX_hits_m6_header2.tab | grep '3[[:space:]]500' | cut -f1,3,6,7,11,17,18 | column -ts $'\t'
# Or somewhat more flexibly, using AWK, which will allow us to include the header in the output, as shown below
awk 'BEGIN{FS=OFS="\t"}NR==1 || (/^Am1/ && $6 == 3 && $7 == 500){print $1,$3,$6,$7,$17,$18}' Am1 100blastX_hits_m6_header2.tab | column -ts $'\t'
qseqid stitle qstart qend qcovs qframe Am1 |[Serratia marcescens]|dhfrXII|dihydrofolate reductase type XII|498|AF284063| 3 500 29 3 Am1 |[Acinetobacter baumannii]|dhrfXII|dihydrofolate reductase|498|DQ995286|19365688 3 500 29 3 Am1 |[Salmonella enterica subsp. enterica serovar Choleraesuis]|dhfr|Dhfr|498|AY551331|15145503 3 500 29 3 Am1 |[Enterococcus faecalis]|dhfrXII|dihydrofolate reductase|498|AB196348|16451182 3 500 29 3
#>>> Retrieve the best hit gnl|casseteDB|10444105 found by the search in the previous step
blastdbcmd -db integron_cassettes4blastdb.faaED -entry 'gnl|casseteDB|10444105' -dbtype prot -out S_marcescens_dhrXII_casseteDB-10444105.faa
#>>> Retrieve all hits found by the search in the previous step for the Am1 gene starting at pos 899 and ending at pos 1699
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699'
grep Am1 100blastX_hits_m6_header2.tab | grep '899[[:space:]]1699' | cut -f2 > Am1_qs899_qe1699_hit_IDs.list
blastdbcmd -db integron_cassettes4blastdb.faaED -entry_batch Am1_qs899_qe1699_hit_IDs.list -dbtype prot -out Am1_qs899_qe1699_AadA_best_hits.faa
As we’ve learned in the example above, we will need to explore many
alignments from a BLASTX search in order to make sure that we
identify all CDSs on a DNA string. The blast-imager.pl
script is a very convenient tool to visualize results, when we have
large lists of hits matching on different parts of our query sequence,
exactly as in our case.
To run the script, we will first split the multifasta of query sequences into individual sequences, to generate graphs the the hit table of individual queries.
# 1. If workin on tepeu or buluc, copy required scripts to your $HOME/bin directory
[ ! -d $HOME/bin ] && mkdir $HOME/bin
cp /home/vinuesa/cursos/intro_biocomp/scripts/*.* $HOME/bin
# 2. if you don't have access to the servers @ LCG, fetch the scripts directly from the command line
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/blast-imager.pl
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/fasta_toolkit.awk
wget -c https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/master/split_fasta.pl
chmod 755 blast-imager.pl fasta_toolkit.awk split_fasta.pl
# Split the multifasta into the individual source sequence files,
# that is, generate single fasta files from the multifasta source file
split_fasta.pl 3cass_amplicons.fna
ls -ltr | tail -3
-rw-r--r--. 1 vinuesa faculty 967 oct 18 21:54 Ap26.fa -rw-r--r--. 1 vinuesa faculty 1798 oct 18 21:54 Ap18.fa -rw-r--r--. 1 vinuesa faculty 1740 oct 18 21:54 Am1.fa
# Run the blast-imager.pl script
# i) first writing the tables to file
#for file in *fa
#do
# blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 -out ${file%.fa}_blastxout.tab;
# ./blast-imager.pl ${file%.fa}_blastxout.tab > ${file%.*}.png
# rm ${file%.fa}_blastxout.tab
#done
# ii) or directly to the graph, by piping the blastx oputput directly into the blast-imager script like so:
#for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | ../../blast-imager.pl > ${file%.*}.png; done
# Run again, imposing some filtering to remove the poorest hits: pident > 98%; mismatches < 4; gaps < 3
# which will render a better figure
for file in *fa; do blastx -query $file -query_gencode 11 -db integron_cassettes4blastdb.faaED -evalue 1e-20 -outfmt 6 -num_alignments 100 | awk '$3 > 98 && $5 < 4 && $6 < 3' | blast-imager.pl > ${file%.*}.png; done
# Have a look at the results for the three query sequences using the "eye of GNOME - eog"
eog Am1.png &
eog Ap18.png &
eog Ap26.png &
# or all at once:
eog *.png &
Figure 3. The graphical output generated by
blast-imager.pl for the 100 Am1 blastx hits
The graphical display shown above in Fig. 3 and the numeric coordinates extracted previously from the hit table, strongly suggest that there are 3 gene cassettes encoded in the amplicon generated for Am1.
1 3 488
4 3 500
2 15 488
13 18 488
2 33 488
1 588 887
1 588 902
1 612 875
1 612 887
6 612 902
1 887 1699
1 899 1678
6 899 1699
1 908 1303
1 908 1516
2 908 1696
35 908 1699
3 908 1693
4 908 1684
6 908 1678
7 920 1699
1 1361 1699
Looking closely at the coordinates and figure 3, we can write the following table with the likely gene coordinates
Table 1. Coordinates for gene cassettes encoded in the Am1 amplicon DNA sequence.
| start | end | strand | note |
|---|---|---|---|
| 3 | 500 | + | - |
| 612 | 902 | + | The 887 and 899 starts would overlap with the first gene; this is not typical for gene cassettes |
| 908 | 1699 | + | - |
We will use the fasta_toolkit.awk script to extract the
three gene cassettes from the corresponding DNA string using the their
coordinates, as defined in Table 1.
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa
>Am1 ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA
# write first CDS to Am1_cassette_CDSs.fna, and append with 2cnd and 3rd to it using >>
fasta_toolkit.awk -R 3 -s 3 -e 500 Am1.fa > Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 612 -e 902 Am1.fa >> Am1_cassette_CDSs.fna
fasta_toolkit.awk -R 3 -s 908 -e 1699 Am1.fa >> Am1_cassette_CDSs.fna
grep '^>' Am1_cassette_CDSs.fna
>Am1 >Am1 >Am1
perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna
perl -pe 'if(/^>(\S+)/){ $c++; $h=$1 . "_ORF" . $c; s/$1/$h/ }' Am1_cassette_CDSs.fna > ed && mv ed Am1_cassette_CDSs.fna
cat Am1_cassette_CDSs.fna
>Am1_ORF1 ATGAACTCGGAATCAGTACGCATTTATCTCGTTGCTGCGATGGGAGCCAATCGGGTTATTGGCAATGGTCCTAATATCCCCTGGAAAATTCCGGGTGAGCAGAAGATTTTTCGCAGACTCACTGAGGGAAAAGTCGTTGTCATGGGGCGAAAGACCTTTGAGTCTATCGGCAAGCCTCTACCGAACCGTCACACATTGGTAATCTCACGCCAAGCTAACTACCGCGCCACTGGCTGCGTAGTTGTTTCAACGCTGTCGCACGCTATCGCTTTGGCATCCGAACTCGGCAATGAACTCTACGTCGCGGGCGGAGCTGAGATATACACTCTGGCACTACCTCACGCCCACGGCGTGTTTCTATCTGAGGTACATCAAACCTTCGAGGGTGACGCCTTCTTCCCAATGCTCAACGAAACAGAATTCGAGCTTGTCTCAACCGAAACCATTCAAGCTGTAATTCCGTACACCCACTCCGTTTATGCGCGTCGAAACGGCTAA >Am1_ORF2 ATGTTTATTCAAACGGCATTTAGCTTTTCAGGCGTTATTCAGTGCCTGTTTTGCCTTTTTTCCGGGCTTCGCCTGCATGGGCTGCGCAGGTTTTCAGTCTTTTTGGCCTCTAGCCCTTGCGTAGCAAGCGCAAGCAGCTATCGTTTTTGCAGTGCTGTGCCGCCTCGGTGGCGCAGCGTTTTTTCACGGTTAGCGCCCGTCGCCAAATTCAAGTTATCCGTTTTGGCTTCTGGGTCTAACATTTCGGTCAAGCCGACCGCCATTCTGCGGTCGGCTTACCTCGCCCGTTAG >Am1_ORF3 ATGAGGGAAGCGGTGACCATCGAAATTTCGAACCAACTATCAGAGGTGCTAAGCGTCATTGAGCGCCATCTGGAATCAACGTTGCTGGCCGTGCATTTGTACGGCTCCGCAGTGGATGGCGGCCTGAAGCCATACAGCGATATTGATTTGTTGGTTACTGTGGCCGTAAAGCTTGATGAAACGACGCGGCGAGCATTGCTCAATGACCTTATGGAGGCTTCGGCTTTCCCTGGCGAGAGCGAGACGCTCCGCGCTATAGAAGTCACCCTTGTCGTGCATGACGACATCATCCCGTGGCGTTATCCGGCTAAGCGCGAGCTGCAATTTGGAGAATGGCAGCGCAATGACATTCTTGCGGGTATCTTCGAGCCAGCCATGATCGACATTGATCTAGCTATCCTGCTTACAAAAGCAAGAGAACATAGCGTTGCCTTGGTAGGTCCGGCAGCGGAGGAATTCTTTGACCCGGTTCCTGAACAGGATCTATTCGAGGCGCTGAGGGAAACCTTGAAGCTATGGAACTCGCAGCCCGACTGGGCCGGCGATGAGCGAAATGTAGTGCTTACGTTGTCCCGCATTTGGTACAGCGCAATAACCGGCAAAATCGCGCCGAAGGATGTCGCTGCCGACTGGGCAATAAAACGCCTACCTGCCCAGTATCAGCCCGTCTTACTTGAAGCTAAGCAAGCTTATCTGGGACAAAAAGAAGATCACTTGGCCTCACGCGCAGATCACTTGGAAGAATTTATTCGCTTTGTGAAAGGCGAGATCATCAAGTCAGTTGGTAAATGA
This operation can be easily performed with the help of
fasta_toolkit.awk
fasta_toolkit.awk -R 4 Am1_cassette_CDSs.fna > Am1_cassette_CDSs_translated.faa
cat Am1_cassette_CDSs_translated.faa
>Am1_ORF1 MNSESVRIYLVAAMGANRVIGNGPNIPWKIPGEQKIFRRLTEGKVVVMGRKTFESIGKPLPNRHTLVISRQANYRATGCVVVSTLSHAIALASELGNELYVAGGAEIYTLALPHAHGVFLSEVHQTFEGDAFFPMLNETEFELVSTETIQAVIPYTHSVYARRNG* >Am1_ORF2 MFIQTAFSFSGVIQCLFCLFSGLRLHGLRRFSVFLASSPCVASASSYRFCSAVPPRWRSVFSRLAPVAKFKLSVLASGSNISVKPTAILRSAYLAR* >Am1_ORF3 MREAVTIEISNQLSEVLSVIERHLESTLLAVHLYGSAVDGGLKPYSDIDLLVTVAVKLDETTRRALLNDLMEASAFPGESETLRAIEVTLVVHDDIIPWRYPAKRELQFGEWQRNDILAGIFEPAMIDIDLAILLTKAREHSVALVGPAAEEFFDPVPEQDLFEALRETLKLWNSQPDWAGDERNVVLTLSRIWYSAITGKIAPKDVAADWAIKRLPAQYQPVLLEAKQAYLGQKEDHLASRADHLEEFIRFVKGEIIKSVGK*
We will run BLASTP a final time to get the best functional annotations for the three ORFs (gene cassettes) identified via BLASTX in the Am1 amplicon.
# 1. run blastp with the translated products and get 10 top hits
echo -e 'qseqid\tsseqid\tqlen\tslen\tstitle\tlength\tpident\tpositive\tmismatch\tgaps\tevalue\tbitscore\tqcovs' > header1.tab
blastp -query Am1_cassette_CDSs_translated.faa -db integron_cassettes4blastdb.faaED -outfmt '6 qseqid sseqid qlen slen stitle length pident positive mismatch gaps evalue bitscore qcovs' -num_alignments 10 > blastp_10hits.out
cat header1.tab blastp_10hits.out > ed && mv ed blastp_10hits.out
# 2. reduce the output table to relevant fields for easier visualization
sed 's/\t/\n/g' header1.tab | cat -n
1 qseqid
2 sseqid
3 qlen
4 slen
5 stitle
6 length
7 pident
8 positive
9 mismatch
10 gaps
11 evalue
12 bitscore
13 qcovs
awk 'BEGIN{FS="\t"; OFS=","; print "qseqid,qlen,slen,stitle,pident,bitscore,qcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.csv
awk 'BEGIN{FS=OFS="\t"; print "qseqid\tqlen\tslen\tstitle\tpident\tbitscore\tqcovs"}{print $1,$3,$4,$5,$7,$12,$13}' blastp_10hits.out > Am1_top10_blastp_hits_for_3ORFs.tsv
qseqid,qlen,slen,stitle,pident,bitscore,qcovs Am1_ORF1,166,166,|[Serratia marcescens]dhfrXII|dihydrofolate reductase type XII|498|AF284063|unpubl,100.000,340,100 Am1_ORF1,166,166,|[Acinetobacter baumannii]|ZJB04|||dhrfXII|dihydrofolate reductase|498|DQ995286|Xu_H.|Curr.Microbiol.59_113-117_2009|19365688,99.398,338,100 Am1_ORF1,166,166,|[Salmonella enterica subsp. enterica serovar Choleraesuis]dhfr|Dhfr|498|AY551331|Huang_T.M.|Vet.Microbiol.100_247-254_2004|15145503,99.398,338,100 Am1_ORF1,166,166,|[Enterococcus faecalis]|||human male bone|China:GuangzhoudhfrXII|dihydrofolate reductase|498|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,98.795,337,100 Am1_ORF1,166,178,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||tri|Tri|534|EU219534|unpubl,100.000,333,98 Am1_ORF1,166,158,|[Escherichia coli]||Rs411b||dhfr17|dihydrofolate reductase|474|DQ875874|Ozugumus_O.B.|J.Microbiol.45_379-387_2007|17978796,36.709,108,95 Am1_ORF1,166,163,|[Escherichia coli]dfr17|dihydrofolate reductase|491|AY828551|Zhang_H.|Microbiol.Immunol.48_639-645_2004|15383699,36.709,108,95 Am1_ORF1,166,158,|[Acinetobacter baumannii]||607460||dfrVII|dihydrofolate reductase|474|EU340417|Xu_X.|Int.J.Antimicrob.Agents32_441-445_2008|18757181,38.608,107,95 Am1_ORF1,166,183,|[Escherichia coli]DfrA17||549|AM886293|unpubl,36.709,108,95 Am1_ORF1,166,158,|[Escherichia coli]||clinical isolate||dhfrIb|dihydrofolate reductase type Ib|474|Z50805|unpubl,37.342,105,95 Am1_ORF2,97,97,|[Serratia marcescens]|unknown|291|AF284063|unpubl,98.969,183,100 Am1_ORF2,97,97,|[Klebsiella pneumoniae]|KpS15unknown|291|AF180731|unpubl,97.938,183,100 Am1_ORF2,97,97,|[Enterococcus faecalis]|||human male bone|China:Guangzhou|putative protein|291|AB196348|Su_J.|FEMSMicrobiol.Lett.254_75-80_2006|16451182,97.938,181,100 Am1_ORF2,97,97,|[uncultured bacterium]|unknown|291|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,94.845,176,100 Am1_ORF2,97,97,|[uncultured bacterium]|||arable soil||hypothetical protein|291|DQ779005|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,78.351,137,100 Am1_ORF2,97,105,|[Aeromonas salmonicida subsp. salmonicida]|123-86 3101unknown|315|AF327731|L'Abee-Lund_T.M|Microb.DrugResist.7_263-272_2001|11759088,78.351,137,100 Am1_ORF2,97,101,|[Pseudomonas aeruginosa]|orfD|303|AY460181|unpubl,77.174,129,95 Am1_ORF2,97,93,|[Pseudomonas aeruginosa]||4677||MexicoorfD|hypothetical protein|279|EF184217|unpubl,76.087,127,95 Am1_ORF2,97,88,|[Salmonella enterica subsp. enterica serovar Enteritidis]||Homo sapiens||Brazil|unknown|266|DQ278190|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.136,122,91 Am1_ORF2,97,97,|[Salmonella enterica subsp. enterica serovar Agona]||Homo sapiens||Brazil|unknown|291|DQ278189|Peirano_G.|J.Antimicrob.Chemother.58_305-309_2006|16782743,76.289,115,100 Am1_ORF3,264,273,|[Klebsiella pneumoniae]|NK29aadA2|819|EF382672|Chen_Y.T.|Antimicrob.AgentsChemother.51_3004-3007_2007|17526756,100.000,531,100 Am1_ORF3,264,264,|[Serratia marcescens]aadA2|streptomycin/spectinomycin adenyltransferase|792|AF284063|unpubl,100.000,530,100 Am1_ORF3,264,264,|[uncultured bacterium]aadA2|streptomycin/spectinomycin adenylyltransferase|792|AY115474|Tennstedt_T.|FEMSMicrobiol.Ecol.45_239-252_2003|19719593,99.621,529,100 Am1_ORF3,264,264,|[Aeromonas hydrophila]|CVM861||diarrheic pig|aada2|aminoglycoside adenyltransferase|792|AY522923|Poole_T.L.|J.Antimicrob.Chemother.57_31-38_2006|16339607,99.621,529,100 Am1_ORF3,264,268,|[Salmonella enterica subsp. enterica serovar Choleraesuis]|OU7519|hybrid of pSCV50 and pSC138||aadA2|AadA2|804|EU219534|unpubl,99.621,528,100 Am1_ORF3,264,270,|[Proteus mirabilis]|C04014|||aadA2|AadA2|810|EU006711|Boyd_D.A.|Antimicrob.AgentsChemother.52_340-344_2008|18025121,99.621,528,100 Am1_ORF3,264,264,|[Proteus mirabilis]aadA2|aminoglycoside adenyltransferase|792|AY681136|unpubl,99.621,528,100 Am1_ORF3,264,264,|[uncultured bacterium]|||arable soil|aadA2|aminoglycoside-3'-adenylyltransferase|792|DQ779004|Heuer_H|Environ.Microbiol.9_657-666_2007|17298366,99.621,527,100 Am1_ORF3,264,264,|[Salmonella typhimurium DT104]|H3380|Salmonella typhimurium phage type DT104||aadA2|streptomycin/spectinomycin adenyltransferase|792|AF071555|Briggs_C.E|Antimicrob.AgentsChemother.43_846-849_1999|10103189,99.242,526,100 Am1_ORF3,264,264,|[Escherichia coli]|W21-3||sewage of swine farm|China: GuangzhouaadA2|streptomycin/spectinomycin 3' adenyltransferase|792|DQ157750|unpubl,98.864,522,100
Table 2 Annotation of the gene-cassettes found in the Am1 amplicon
| ORF | start | end | strand | gene | annotation |
|---|---|---|---|---|---|
| 1 | 3 | 500 | + | dhfrXII | dihydrofolate reductase type XII |
| 2 | 612 | 902 | + | orfD | hypothetical protein |
| 3 | 908 | 1699 | + | aadA2 | aminoglycoside adenyltransferase type II |
Figure 4. Prevalence of class-1 integrons and distribution of identified gene cassette arrays among Enterobacteriaceae isolates recovered from clean and contaminated rivers in Morelos, Mexico. Results from Jazmín Ramos Madrigal’s bachelor thesis @lcgunam.
Repeat the BLASTX based identification of gene cassettes for the Ap18 and Ap26 amplicons, extract the corresponding CDSs, translate them and annotate them with BLASTP, writing a table with the single-best annotation for each ORF.